[SPARK-40883][CONNECT] Support Range in Connect proto by amaliujia · Pull Request #38347 · apache/spark

amaliujia · 2022-10-22T08:15:52Z

What changes were proposed in this pull request?

Support Range in Connect proto.
Refactor SparkConnectDeduplicateSuite to SparkConnectSessionBasedSuite

Why are the changes needed?

Improve API coverage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

amaliujia · 2022-10-22T19:00:00Z

R: @cloud-fan

AmplabJenkins · 2022-10-22T22:18:16Z

Can one of the admins verify this patch?

cloud-fan · 2022-10-24T03:03:22Z

connector/connect/src/main/protobuf/spark/connect/relations.proto

end is not optional, but how do we know if the client forgets to set it? 0 is a valid end value as well.

Yeah this becomes tricky. Ultimately we can wrap every such field into a message so we always know if that field is set or not set. However that might complicate entire proto too much.. Let's have a discussion on that.

cloud-fan · 2022-10-24T03:05:27Z

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

we can call session.leafNodeDefaultParallelism

...nect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectSessionBasedSuite.scala

amaliujia · 2022-10-30T06:04:15Z

PR should be ready for review again.

cloud-fan · 2022-10-31T06:47:18Z

...tor/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala

how about we call spark.range(10).toDF then we don't need to add comparePlansDatasetLong?

Let me try to see if it gives an exact plan.

Another idea might be we just compare the result through collect() so we do not compare the plan on this case.

Oh the .toDF() just convert things into DataFrame.

It has removed the comparePlansDatasetLong

amaliujia · 2022-10-31T19:37:08Z

Conflict resolved

amaliujia · 2022-10-31T19:38:45Z

connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala

+          end: Int,
+          step: Option[Int],
+          numPartitions: Option[Int]): Relation = {
+        val range = proto.Range.newBuilder()


Note that I need to keep proto.Range as Range itself is a built-in scala class so we need proto. to differentiate for this special case.

I've been explicitly requesting this a couple of times already, as a coding style to always prefix the proto generated classes with their proto. prefix. I know it uses a little bit more horizontal space, but at the same time it makes always clear where this particular element comes from which is tremendously useful because we're consistently using the different types between the catalyst API and Spark Connect in the same code paths.

It makes sense for SparkConnectPlanner where Catalyst and Proto are both mixed together, and we are keeping the approach you are asking there.

However this is the Connect DSL that only deal with protos. No Catalyst included in this package:

spark/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala

Line 17 in 9fc3aa0

package org.apache.spark.sql.connect

As long as no catalyst is in this package this is good with me. Thanks for clarifying.

grundprinzip · 2022-10-24T12:26:05Z

connector/connect/src/main/protobuf/spark/connect/relations.proto

Is this really the best way to express the optionality?

There are two dimensions of things in this area:

Required versus Optional.
A field is required, meaning it must be set. A field can be optional. Meaning it could be set or not.

Field has default value or not.
A field can have a default value if not set.

The second point is an addition for the first point. If there is a field which is not set, there could be a default value to be used.

There are special cases that the default value for proto, is the same as the default value that Spark uses. In that case we don't need to differentiate the optionality. Otherwise we need this way to differentiate set versus not set, to adopt default values of Spark (unless we don't care the default values in Spark).

To really answer your question: if we plan to respect default values for Spark for those optionally fields whose default proto values are different from Spark default values, this is the only way to respect default values for Spark.

So in fewer words :) when num_partitions is an integer the default value is 0 even if it's not and for scalar types we can't differentiate between present or not. Understanding if 0 is a valid or invalid value defeats the purpose.

Thanks for the additional color!

grundprinzip · 2022-10-31T19:46:42Z

connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala

+          end: Int,
+          step: Option[Int],
+          numPartitions: Option[Int]): Relation = {
+        val range = proto.Range.newBuilder()


I've been explicitly requesting this a couple of times already, as a coding style to always prefix the proto generated classes with their proto. prefix. I know it uses a little bit more horizontal space, but at the same time it makes always clear where this particular element comes from which is tremendously useful because we're consistently using the different types between the catalyst API and Spark Connect in the same code paths.

cloud-fan · 2022-11-01T02:16:06Z

thanks, merging to master!

zhengruifeng · 2022-11-01T05:31:42Z

connector/connect/src/main/protobuf/spark/connect/relations.proto

+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;


start, end, step should use int64 @amaliujia

Yes let me follow up. I guess I was looking at python side API somehow thus confused myself on the types.

Updating in #38460.

### What changes were proposed in this pull request? 1. Support `Range` in Connect proto. 2. Refactor `SparkConnectDeduplicateSuite` to `SparkConnectSessionBasedSuite` ### Why are the changes needed? Improve API coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes apache#38347 from amaliujia/add_range. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added CONNECT CORE PYTHON SQL labels Oct 22, 2022

cloud-fan reviewed Oct 24, 2022

View reviewed changes

...nect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectSessionBasedSuite.scala Outdated Show resolved Hide resolved

amaliujia marked this pull request as draft October 27, 2022 02:15

amaliujia force-pushed the add_range branch from bdb0f1c to 8d3bca9 Compare October 29, 2022 08:01

amaliujia marked this pull request as ready for review October 29, 2022 08:01

cloud-fan reviewed Oct 31, 2022

View reviewed changes

amaliujia force-pushed the add_range branch from d696968 to 7651173 Compare October 31, 2022 07:05

cloud-fan approved these changes Oct 31, 2022

View reviewed changes

amaliujia added 2 commits October 31, 2022 12:31

[SPARK-40883][CONNECT] Support Range in Connect proto

fa50db9

update

50f7567

amaliujia force-pushed the add_range branch from 7651173 to 50f7567 Compare October 31, 2022 19:37

amaliujia commented Oct 31, 2022

View reviewed changes

grundprinzip reviewed Oct 31, 2022

View reviewed changes

cloud-fan approved these changes Nov 1, 2022

View reviewed changes

cloud-fan closed this in b3ed0c1 Nov 1, 2022

zhengruifeng reviewed Nov 1, 2022

View reviewed changes

Conversation

amaliujia commented Oct 22, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

amaliujia commented Oct 22, 2022

Uh oh!

AmplabJenkins commented Oct 22, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amaliujia commented Oct 30, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia commented Oct 31, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 1, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amaliujia Oct 31, 2022 •

edited

Loading